Protein - Structure Mapping, Alignments, and Visualization

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

Input: Protein ID + amino acid sequence + mutated sequence(s)
Output: Representative protein structure, sequence alignments, and visualization of mutations

Imports

In [1]:
import sys
import logging
In [2]:
# Import the Protein class
from ssbio.core.protein import Protein
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROTEIN_ID will be created
  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

ROOT_DIR
└── PROTEIN_ID
    ├── sequences  # Protein sequence files, alignments, etc.
    └── structures  # Protein structure files, calculations, etc.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
class ssbio.core.protein.Protein(ident, description=None, root_dir=None, pdb_file_type=’mmtf’)[source]

Store information about a protein, which represents the monomeric translated unit of a gene.

The main utilities of this class are to:

  • Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
  • Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
  • Set a single representative_sequence and representative_structure
  • Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
  • Provide summaries of alignments and mutations seen
  • Map between residue numbers of sequences and structures
Parameters:
  • ident (str) – Unique identifier for this protein
  • description (str) – Optional description for this protein
  • root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

Todo

  • Implement structural alignment objects with FATCAT
In [7]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')
In [8]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)
Out[8]:
<SeqProp WT at 0x7f5f79cbd908>

Mapping sequence –> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

Note: These methods do not download any 3D structure files.

Methods

Protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns:

List of new PDBProp objects added to the structures attribute

Return type:

list

In [9]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()
Out[9]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[9]:
pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
pdb_id
2zya A 2319.0 0.0 0.987179 0.963675 451 462
2zya B 2319.0 0.0 0.987179 0.963675 451 462
2zyd A 2319.0 0.0 0.987179 0.963675 451 462
2zyd B 2319.0 0.0 0.987179 0.963675 451 462
2zyg A 2284.0 0.0 0.982906 0.950855 445 460

Downloading and ranking structures

Methods

Protein.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to the protein structures directory.

Parameters:
  • outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Returns:

List of PDB IDs that were downloaded

Return type:

list

Todo

  • Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [10]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)
Out[10]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[10]:
pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
pdb_id
2zya Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.6 6PG Escherichia coli 2zya.mmtf
2zyd Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.5 GLO Escherichia coli 2zyd.mmtf
Protein.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source]

Set a representative structure from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.
  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • clean (bool) – If structure should be cleaned
  • keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:

Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.

Return type:

StructProp

Todo

  • Remedy large structure representative setting
In [11]:
# Set representative structures
my_protein.set_representative_structure()
Out[11]:
<StructProp REP-2zyd at 0x7f5f79bea1d0>

Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.

In [12]:
my_protein.__dict__
Out[12]:
{'_root_dir': '/tmp',
 'description': None,
 'id': 'SRR1753782_00918',
 'notes': {},
 'pdb_file_type': 'mmtf',
 'representative_chain': 'A',
 'representative_chain_seq_coverage': 95.7,
 'representative_sequence': <SeqProp WT at 0x7f5f79cbd908>,
 'representative_structure': <StructProp REP-2zyd at 0x7f5f79bea1d0>,
 'sequence_alignments': [<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79be7a90>,
  <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79299080>],
 'sequences': [<SeqProp WT at 0x7f5f79cbd908>],
 'structure_alignments': [],
 'structures': [<PDBProp 2zyd at 0x7f5fdb48e080>,
  <PDBProp 2zya at 0x7f5fd538a550>,
  <PDBProp 3fwn at 0x7f5f79cbde48>,
  <PDBProp 2zyg at 0x7f5f79cbdda0>,
  <StructProp REP-2zyd at 0x7f5f79bea1d0>]}

Methods

Protein.load_manual_sequence(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
  • ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
  • write_fasta_file (bool) – If this sequence should be written out to a FASTA file
  • outdir (str) – Path to output directory
  • set_as_representative (bool) – If this sequence should be set as the representative one
  • force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns:

Sequence that was loaded into the sequences attribute

Return type:

SeqProp

In [ ]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)
In [ ]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)
Protein.pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine=’needle’, parse=True, force_rerun=False)[source]

Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the sequence_alignments DictList attribute.

Parameters:
  • gapopen (int) – Only for engine='needle' - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for engine='needle' - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str) – Only for engine='needle' - Path to output directory. Default is the protein sequence directory.
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – Only for engine='needle' - Default False, set to True if you want to rerun the alignment if outfile exists.
In [ ]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()
In [ ]:
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]

# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)
Protein.sequence_mutation_summary(alignment_ids=None, alignment_type=None)[source]

Summarize all mutations found in the sequence_alignments attribute.

Returns 2 dictionaries, single_counter and fingerprint_counter.

single_counter:

Dictionary of {point mutation: list of genes/strains} Example:

{
    ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'],
    ('R', 33, 'T'): ['Strain2']
}

Here, we report which genes/strains have the single point mutation.

fingerprint_counter:

Dictionary of {mutation group: list of genes/strains} Example:

{
    (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'],
    (('A', 24, 'V')): ['Strain1', 'Strain4']
}

Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations

Parameters:
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
Returns:

single_counter, fingerprint_counter

Return type:

dict, dict

In [ ]:
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f

Some additional methods

Getting binding site/other information from UniProt

In [ ]:
import ssbio.databases.uniprot
In [ ]:
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features

Mapping sequence residue numbers to structure residue numbers

Methods

Protein.map_seqprop_resnums_to_structprop_resnums(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in any SeqProp to the structure’s residue number for a specified chain.

Parameters:
  • resnums (int, list) – Residue numbers in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map to
  • use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:

Mapping of sequence residue numbers to structure residue numbers

Return type:

dict

In [ ]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45],
                                                                       use_representatives=True)
structure_sites

Viewing structures

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. ssbio view functions will either return a NGLWidget object, which is the same as using nglview like the below example, or act upon the widget object itself.

# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view

Methods

StructProp.view_structure(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • only_chains (str, list) – Chain ID or IDs to display
  • opacity (float) – Opacity of the structure
  • recolor (bool) – If structure should be cleaned and recolored to silver
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

In [ ]:
# View just the structure
view = my_protein.representative_structure.view_structure()
view
Protein.add_mutations_to_nglview(view, alignment_type=’seqalign’, alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color=’red’, unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source]

Add representations to an NGLWidget view object for residues that are mutated in the sequence_alignments attribute.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
  • grouped (bool) – If groups of mutations should be colored and sized together
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • unique_colors (bool) – If each mutation/mutation group should be colored uniquely
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
In [ ]:
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7),
                                    use_representatives=True)
Protein.add_features_to_nglview(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Add select features from the selected SeqProp object to an NGLWidget view object.

Currently parsing for:

  • Single residue features (ie. metal binding sites)
  • Disulfide bonds
Parameters:
  • view (NGLWidget) – NGLWidget view object
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
In [ ]:
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)

Saving

Protein.save_json(outfile, compression=False)

Save the object as a JSON file using json_tricks

In [ ]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))